a hybrid accurate alignment method for large persian-english corpus construction based on statistical analysis and lexicon/persian word net

نویسندگان

mohammad bagher dastgheib ph.d. candidate department of computer science and engineering, shiraz university, shiraz, iran

seyed mostafa fakhrahmad department of computer science and engineering, shiraz university, shiraz, iran

mansour zolghadri jahromi department of computer science and engineering, shiraz university, shiraz, iran

چکیده

a bilingual corpus is considered as a very important knowledge source and an inevitable requirement for many natural language processing (nlp) applications in which two languages are involved. for some languages such as persian, lack of such resources is much more significant. several applications, including statistical and example-based machine translation needs bilingual corpora, in which large amounts of texts from two different languages have been aligned at the sentence or phrase levels. in order to meet this requirement, this paper aims to propose an accurate and hybrid sentence alignment method for construction of an english-persian parallel corpus. as the first step, the proposed method uses statistical length based analysis for filtering of candidates. punctuation marks are used as a directing feature to reduce the complexity and increase the accuracy. finally, the proposed method makes use of some lexical knowledge in order to produce the final output. . in the phase of lexical analysis, a bilingual dictionary as well as a persian semantic net (denoted as farsnet) is used to calculate the extended semantic similarity. experiments showed the positive effect of expansion on synonym words by extended semantic similarity on the accuracy of the sentence alignment process. in the proposed matching scheme, a semantic load based approach (which considers the verb as the pivot and the main part of a sentence) was also used in order for increasing the accuracy. the results obtained from the experiments were promising and the generated parallel corpus can be used as an effective knowledge source by researchers who work on persian language.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MIZAN: A Large Persian-English Parallel Corpus

One of the most major and essential tasks in natural language processing is machine translation that is now highly dependent upon multilingual parallel corpora. Through this paper, we introduce the biggest Persian-English parallel corpus with more than one million sentence pairs collected from masterpieces of literature. We also present acquisition process and statistics of the corpus, and expe...

متن کامل

Word Alignment of English-Chinese Bilingual Corpus Based on Chucks

In this paper, a method for the word alignment of English-Chinese corpus based on chunks is proposed. The chunks of English sentences are identified firstly. Then the chunk boundaries of Chinese sentences are predicted by the translations of English chunks and heuristic information. The ambiguities of Chinese chunk boundaries are resolved by the coterminous words in English chunks. With the chu...

متن کامل

Supporting Large English-Hindi Parallel Corpus using Word Alignment

This paper gives description about methodology to understand parallel English-Hindi sentences using word alignment. This methodology is foundation to develop the parallel EnglishHindi word dictionary after syntactically and semantically analysis of the English-Hindi source text. Methodology of proposed system is used for the English and Hindi sentences; also the methodology can be used for othe...

متن کامل

a contrastive genre analysis of persian and english job application letters

کارشناسان "بررسی مقابله ای نوشتار" در زبان های مختلف بر این باورند که زبان و فرهنگ مبدأ بر نحوه ی نگارش نویسندگان در زبان دوم تاثیر گذار است. درخواست نامه های شغلی نیز از این قاعده مستثنا نیستند. بر پایه ی اصول بررسی "ژانر" یا "نوع" قادر خواهیم بود به بینشی در زمینه ی یک "ژانر" یا "نوع" خاص مانند درخواست نامه های شغلی دست یابیم. علیرغم مطالعات متعدد در زمینه ی جنبه های گوناگون "بررسی نوع" و ...

15 صفحه اول

Cultural Influence on the Expression of Cathartic Conceptualization in English and Spanish: A Corpus-Based Analysis

This paper investigates the conceptualization of emotional release from a cognitive linguistics perspective (Cognitive Metaphor Theory). The metaphor weeping is a means of liberating contained emotions is grounded in universal embodied cognition and is reflected in linguistic expressions in English and Spanish. Lexicalization patterns which encapsulate this conceptualization i...

متن کامل

a comparative pragmatic analysis of the speech act of “disagreement” across english and persian

the speech act of disagreement has been one of the speech acts that has received the least attention in the field of pragmatics. this study investigates the ways power relations, social distance, formality of the context, gender, and language proficiency (for efl learners) influence disagreement and politeness strategies. the participants of the study were 200 male and female native persian s...

15 صفحه اول

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید


عنوان ژورنال:
international journal of information science and management

جلد ۱۴، شماره ۲، صفحات ۰-۰

کلمات کلیدی

میزبانی شده توسط پلتفرم ابری doprax.com

copyright © 2015-2023